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Abstract 

This paper introduces a general framework of covariance structures that can be 
verified in many popular statistical models, such as factor and random effect models. 
The new structure is a summation of low rank and sparse matrices. We propose a LOw 
Rank and sparsE Covariance estimator (LOREC) to exploit this general structure 
in the high-dimensional setting. Analysis of this estimator shows that it recovers 
exactly the rank and support of the two components respectively. Convergence rates 
under various norms are also presented. The estimator is computed efficiently using 
convex optimization. We propose an iterative algorithm, based on Nesterov's method, 
to solve the optimization criterion. The algorithm is shown to produce a solution 
within 0(t^^) of the optimal, after any finite t iterations. Numerical performance is 
illustrated using simulated data and stock portfolio selection on S&P 100. 

Keywords: lasso, nuclear norm, sparse, low rank, covariance matrix, factor model, random 
effect. 

AMS 2010 Subject Classification: Primary 62H12, 90C25; secondary 62G05, 68W40. 



*Xi Luo is Assistant Professor, Department of Biostatistics and Center of Statistical Sciences, Brown 
University, Providence, RI 02912. Email: xluo@stat.brown.edu. 

Part of the research was done while the author was a visiting lecturer and postdoc at The Wharton 
School, University of Pennsylvania, Philadelphia, PA 19104. The author would like to thank Professor 
Tony Cai, Dorothy Silberberg Professor, Department of Statistics, The Wharton School, University of 
Pennsylvania, for his valuable mentoring and many helpful suggestions. 



1 



1 Introduction 



Covariance estimation is fundamental in multivariate analysis. They has been widely used 
to model gene microarrays, fMRI, financial and climate time series, among many other 
important applications. The natural estimator, sample covariance, is known to perform 
badly in high dimensions, where the number of variables is larger than or comparable 
to the sample size. To stabilize the estimator, past research efforts mostly rooted on a 
sparse or banded covariance structure. Many popular statistical models, however, entails 
a different covariance structure than these cases considered. In this paper, we propose a 
decomposition approach, low rank plus sparse, to address this challenge. 

To begin with, suppose we observe Xi, . . . , X„ G iid from a multivariate Gaussian 
distribution N{n, S*). The canonical estimator for S* is the sample covariance 

n 

s„ = - X)(X, - X)"^ 

i=l 

where the sample mean X = (1/n) Y17=i-^i- Sample covariance and correlation matrices 
are illustrated in Figure [T] for 53 monthly stock returns from 1972-2009. Details about this 
dataset can be found in Section I5l2l Note that 86.14% of the correlations are larger than 
0.2, more than 50% are larger than 0.3. 

Figure 1: Sample covariance and correlation of monthly returns of SPlOO from 1972-2009. 
(a) Sample Covariance (b) Sample Correlation 




Regularizing S„ has been proposed in the literature t o stabilize the estiniator, u nder 
various structural assumptions. In the time series setting, iBickel and Levinal (l2008al) pro 



posed banding S„. The optimal convergence rate of banding was established in ICai et al. 
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(j2010bl ). IWu and Pourahmadil (120031 ): iHuang et al.l (120061) employed regularized regres- 
sion based on modified Cholesky decomposition. In a setting whe r e the indexing order 
is unavailable, thresholding S„ was pro posed by iBickel and Levinal ( l2008bl ). Other types 



ding rules were co nsidered in 



Rothman et al. 



Cai and Liu 



( I2OO9I ). and adapt iveness was ad- 



of thresho 
dressed in 

ber of nonzero off-diagonal entries is small comparing to the sample size. Without imposing 



(120111 ). There and in what follows, sparsity roughly means the num- 



such e xplicit structural assumptions, shrinkage estimation was proposed in lLedoit and Wolf 
(I2OO4J ). Note that banding, thresholding and shrinkage estimators can be obtained using 
elementwise operations, simply treating a matrix as a collection of scalars. This paper also 
studies the order invariant situation, and it seems to be unnatural to apply thresholding 
directly for our dataset in Figure (H partly because most of the entries have large mag- 
nitude. However, thresholding may still be reasonable after removing some other effects. 
We consider this could be due to a hidden factor for this dataset. In short, we pursue to 
decompose the covariance matrix into two regularized components. 

Indeed, dimension deduction through matrix decomposition has been employed for a 
long history. Classical examples in clude princ i pal c omponent analysis (PCA) and factor 
analysis, see the standard reference lAndersonI (119841 ). The main idea is that multivariate 
data can be explained using a few components after decomposition. These methods have 
enjoyed numerical successes in practice, but mostly in studying X. Moreover, the implica- 
tions of these models to covariance estimation were unclear yet. Motivated by these, and 
other specific models to be explained momentarily, we propose a unified framework that 
entails the following covariance structure 



L* + S* 



(1) 



where L* is a low rank component and S* is a sparse component. Examples fall within this 
framework include the following. 



1. Factor analysis: consider the following concrete model 



X = Bf + e 



(2) 



where B is a fixed matrix in MP^^ , factor f G M'^ and error e G are independently 
gene rated by some r andom processes. Factor m odel ( Ml was widely studied in statis- 



1983 



tics ( Andersonl.ll984j). in econo mics and finance (IRoss 



1976 



Fama and French 



Chamberlain and Rothschildl . 



19921 ). in signal processing ( iKrim and Vibergl . Il996l ). The 
number of factors, K, is usually ass umed to be sm all, for example = 1 in the 
capital asset pricing model (CAPM) (ISharpd . Il964] ) and K = 3 in the Fama-French 
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model ( iFama and Frenchl . 119921 ) . It is easy to see that the imphed covariance matrix 
of model ([2]) is 

S* = BVar(f)B'^ + (3) 

where the error covariance 5]^ = Eee^. Note t hat S* — has at most rank K. Us- 

proposed a regression type estimator, 



Fan et al. 



ing the above decomposition, 
assuming f is observable (thus K is known) and is diagonal. First this paper 
considers a hid den f setting. Mor e over, w e will assume in this paper that is sparse 
in the spirit of iBickel and Levinal (j2008bl ). i.e. the errors could be correlated at a few 
places. This is because non- diagonal structures have also b e en co nsidered appropriate 
in modeling stock data, see IChamberlain and Rothschild! ( 1l983l ) for example. More 



Fan et al 



20081 ). our approach shows an im- 



importantly, in the simplified setting of 
proved convergence rate 0{n~^^'^{p + p^/'^K^/'^) under the Frobenius norm, compared 
with their rate 0{n~^/'^pK). Additional improvements will be discussed in Section 
13.21 For example, our method can also recover the exact rank and sparsity pattern. It 
serves as a important step towards adapting to the factor model order and modeling 
possible correlated errors when the factors are unobservable. 

2. Random effect: modeling the covariance of repeated measures is an important topic 
in many contexts, for example longitudinal data analysis. Take the Compound Sym- 
metry model below as an illustrating case, which can also be theoretically justified 
under certain experimental designs. 



(4) 



where o"^ and S* are between-subject and within-subject variances respectively, and 
usually S* = cr^I. First, some other random effect models also admit such a two com- 
ponent decomposition, see examples in 



Fitzmaurice et al 



(|2004[). Mor e over, other 



types of S* have been considered reasonable. iJennrich and Schluchterl (jl986[ ) pro- 
posed an AR(1) type component in the time-series setting. In a similar rationale as 
before, we propose S* to be sparse for non-ordered variables. We will also assume 
S* to be positive definite to ensure the positivity of the resulting covariance. In this 
setting, our method can also recover the support of S*. 

3. Spiked covariance: To s tudy certain type of time series data (ECG for example). 



Johnstone and Lul (120091 ) proposed a spiked covariance model 

S* = /3uu^ + S* 



(5) 
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where most of the coordinates of u are zero. Previous results mostly assumed S* = 
o"^I. Inconsistency of classical PCA on esti mating u was identified, and th ey proposed 
a consistent approach called sparse PCA. Amini and Wainwright ( 20091 ) established 
consistent subset selection on the nonzero coordinates of u, under the scaling n > 
Ck^ \og{p—k)^, where k is the number of the nonzero components in u. In this setting, 
our estimator (with a simple modification) achieves the same rate as theirs, but we 
can allow the second component S* to be sparse and non-diagonal. In addition, our 
method can establish improved matrix loss results in this setting, see Section 13.41 
An extension of this current frame may include multiple u's, and we will report the 
results elsewhere. 

4. Conditional covariance: estimating the conditional covariance is essential for finan- 
cial mode l ing an d pricing, probably popularized by the GARCH model. Following 



AndersonI (jl984j ). suppose the multivariate normal vector X = (Xf , X2 )^, condition- 



ing on the subvector X2, the conditional covariance of the subvector Xi is 



where Sjj, for i,j = 1,2, is standard matrix partition of the full covariance. It may 
be reasonable to assume that the marginal covariance Sn is sparse as before, and 
then the first component again can have rank equal or less than the length of X2. In 
contrast to other model-based methods in the finance literature, our method provides 
a non-model based approach with statistical performance guarantees. 

In summary, our general framework ([T]) is less model dependent, but also include many 
interesting model-driven cases. Certainly, the above list is no way complete. 

Our approach also diffe rs, from classica l low rank approximation methods. For example, 



low r ank approximation ( iZhang and Wu 



2003 



Berge and Kiers 



1991 



Malioutov et al 



20061 ) to the covariance does not usually provide a full rank covariance estimator, neither 
does PCA. Spectral norm results in Section [3] provide assurance that the resulting estimator 
is full rank with high probability. In there, we also show that the two components are 
recovered exactly, and its implications will be discussed in the model settings listed before. 

To exploit the low rank and sparse structure in ([T]), we propose a penalized approxima- 
tion criterion using a composite penalty, called LOw Rank and sparsE Covariance estimator 
(LOREC). A simple approximation criterion using the Frobenius norm is proposed, and 
it enjoys several advantages. To r nake our ob j ective computationally attractive, a com- 
posite penalty of a nuclear norm ( iFazel et al.l . I2OOII ) and an ii norm ( iTibshirani 119961 ) 



^The power in k can be reduced to 1 if additionally assuming small k = 0{log{p)). 
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is used to enforce the low rank and sparse properties of the two co mponents. T. 
type of penalty was employ ed under different settings before, see 



Candes et al 



Chandrasekaran et al. 



nesame 

(120101 ) for example. In particular, the latter established theoreti- 
cal properties of a penalized likelihood estimation of the inverse covariance, due to latent 
variables. Though the proof followed their analysis techniques with modifications for our 
objective, we show improved convergence rates in some cases because this problem is very 
different. For example, in the worse case scenario (such as compound symmetry), LOREC 
achieves an improved rate 0(n~^/^p^/^r~^/^ max(logp, r)^/^) for the low rank component 
under the spectral norm, comparing with their suboptimal rate 0{n~^^'^pr~^^'^). This rate 
is further improved under the spike covariance model, to be 0{{k + s)n~^^'^(logpY^'^) where 
S* is s sparse, see definition in Section 13.11 We achieve also some other improvements in 
assumptions due to the simple form of our non-likelihood loss. Last but not the least, this 
paper also provides additional algorithmic advantages which now we turn to. 

Computation of our method c an be efficiently carried out using our iterativ e algorithm. 



1983 



Nesterov and Nestero' 



H2004h 



for smooth 



It is based on Nesterov's method (iNesterovl . 
objective functions. We adapt a recent development and analysis by Ji and Yg ( 2009 ) to 
our composite non-smooth penalty framework. For our non-smooth objective function, 
our algorithm is shown to achieve the optimal iteration complexity, among all first order 
methods for smooth optimization problems. Each iterative step can be explicitly formed in 
closed form, and it does not require excessive memory storage. Therefore, It is attractive 
for large-scale problems. 

This paper is organized as follows. In Section |2l we introduce our problem set up and our 
method based on convex optimization. Its statistical loss bounds are presented in Section 
El Our algorithmic developments are discussed in Section HI and its complexity analysis 
are established therein. The numerical performance is illustrated in Section [5l by both 
simulation examples and a real data set. All proofs are delegated to Section [6l Possible 
directions are finally discussed in Section [71 All convergence and complexity results are 
non-asymptotic. 

Notations Let M = (Mij) be any matrix. The following matrix norms on M will be 
used: |M|i = Yli^jl^ijl stands for the elementwise £i norm, i.e. the sum of absolute 



values of all matrix entries; I Ml 



maxj J- 1 Mj 



I for the elementwise 
IMIf = 



norm, i.e. the 



Si Ylj ^ij the Frobenius 



maximum absolute values of all matrix entries; 
norm; ||M||^, = Da for the nuclear (trace) norm if the SVD decomposition M = UDV^, 
i.e. the sum of singular values. We denote (A, B) := trace(A-^B) for matrices A and B of 
proper sizes. We also denote the vectorization of a matrix M by vec(M). Generic constants 
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are denoted by C, Ci, C2, . . . , and they may represent different values at each appearance. 



2 Methodology 

We introduce the LOREC estimator in this section. Existing approaches that motivate 
LOREC will be brieffy reviewed first. Other variants and extensions of LOREC will dis- 
cussed in Section [71 

Given a sample covariance matrix low rank approximation seeks a matrix M that 
is equivalent to optimize 



mm 



|^|M-S„|^ + Arank(M)| (6) 



for some penalization parameter A > 0. The multiplier 1/2 before the Frobenius norm is 
not essential, and is chosen here for simplicit y in analysis. This estima tor is known as low 



rank approximation in matrix computation ( IHorn and Johnsonl . Il994j ). A variant of the 
ob jective function with an additional constraint that diag(M) = diag(S„) was proposed 
by IZhang and Wul (|2003[ ). Note that can be solved using the r largest eigenvalues and 
eigenvectors from the eigen decomposition of where r is determined by A. One major 
problem of using ([6]) to estimate S* is that the resulting estimator is usu ally rank deficient. 



Recently, elementwise thresholding procedures were introduced by 



Bickel and Levina 



( l2008bl ). The hard thresholding variant is equivalent to the following optimization 



mm 




J]pen,(M,,)l (7) 



where the hard thresholding penalty is 2pen (x) = — (\ x\ — < p\ for some 



thresholding level p > 0. Other penalty forms were studied in lRothman et al.l (120091 ). One 



question of using such a procedure is that it treats matrix elements like scalars, and its 
underlying sparsity assumption is yet justifiable under some classical multivariate models, 
as illustrated in Section [TJ On the other hand, most variations of the data do not always 
follow a single variable direction, but along their singular vectors, as in principal component 
analysis. More importantly, many data examples suggest a non-sparse covariance structure 
as well, as in Figure [H 

To address these concerns, we let S* to be decomposable as a low rank matrix plus a 
sparse matrix in ([1]). Adding a sparse matrix (possibly positive definite) naturally over- 
come the shortcoming of using a low rank matrix (possibly positive semi-definite). This 
nonparametric modeling approach is less model dependent, and thus can be applied in 
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a broad range of applications. The identifiability issue of such a decomposition will be 
addressed in Section [31 

Under our structural decomposition, it is natural to consider a two-component estimator 
S = L + S, with a low rank matrix L and a sparse matrix S. To enforce such properties 
of these two components, one simple idea is to combine the two penalty functions from 
and ([7]), and it leads to the following objective function 



mm < - |L + S — Tinfp + Arank(L) + ^^penp(S'j. 



L,S 



(8) 



However, 



is computationally intractable due to the composite penalty, and it is a 



com binatorial optimization in genera. 



ces (iTibshirani , 11996 



Friedman et al. 



Ado pting the ii norm heuristic for sparse matri- 



20081 ). and the nuclear norm for low rank matrices 



( iFazel et al.l . l200l[ ). we consider replacing the thresholding and rank penalty terms by the 
ii norm and nuclear norm penalties respectively. Thus we propose the LOREC estimator 
S = L + S via the following optimization: 



mm 

L,S 



|L + S-S„|; + A||L 



P|S| 



(9) 



where A > and p > are penalization parameters. This objective is convex and can be 
solved efficiently, with proven global iterative complexity bounds, using the algorithm in 
Section |H Note that the symmetry of (L, S) is always guaranteed because the objective 
is invariant under transposition. Moreover, the symmetry is also preserved numerically as 
the iterative algorithm always produce symmetric updates from symmetric initializations. 

The objective function is non-likelihood based, where the Frobenius norm loss is used. 
Firstly, this loss is a natural extension of the mean squared loss for vectors. Indeed it is the 
£2 norm of either eigenvalues or matrix elements, and thus is a natural distance for both 
spaces where these two components live within. The second reason is that this provides a 
model-free method, as covariance is also model-free. This approach can produce a robust 
estimator even the likelihood function could be misspecified. If the multivariate Gaussian 



assumption holds, advantag es of such non-likelihood met 



10 ds have been observed in other 



high dimensional problems ( iMeinshausen and Biihlmann 



2006 



Cai et al. 



20111). Finally, 



the Frobenius norm loss provides an advantage of fast computation. 

The sample covariance is used in our construction (Q because of convenience. Any 
other estimator close to the underlying covariance can be used instead. Improvement can 
be achieved in some cases with such replacements. For instance, we illustrate this in Section 
13.41 by replacing with thresholded sample covariance. There are other variants of penalties 
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that can be used in (I9l). For example, the diagonal can be unpenalized by replacing the ^l 
norm by 15*11^0 = '^i^j\Sij\. Other possibilities will be discussed in Section [71 



3 Statistical Performance 

We show our main results regarding the performance of LOREC first, without assuming 
any specific parametric structure. We then discuss the implications of our main results 
under some special cases. In particular, we will consider the factor, compound symmetry, 
and spiked covariance models, and compare our results with existing ones. The conditional 
covariance case is omitted due to its similar conclusion to the factor model case. 



3.1 Main Results 

For the ident i fiabili ty of L* and S*, we ne e d som e additional assumptions. Following 



Candes et al 



(120091); 



Chandrasekaran et al. 



( 120101 ). we need these definitions. For any 



matrix M G M^^^, define the following tangent spaces 

fi(M) = {N e RP^P I support(N) C support(M)} , 
T(M) = {UYf + YaV^ | Yi, Y2 G W''] 

where the SVD decomposition M = UDV^ with U G W\ V G W"''' and diagonal 
D G . In addition, define the measures of coherence between r2(M) and T(M) by 



e(r(M)) 

/x(fi(M)) 



max 



INI 



NeT{M), ||N||2<1 

max II N| 

NeC(M), |N|^<1 



2 • 



Detailed discussions of the ab ove quantities l^fM), T f M), £ '(T(M')') and /i(fi(M)) and their 
implications can be found in IChandrasekaran et al.l (120 lOf ). We will give explicit form of 



these quantities under specific cases, see Section 13.31 and 13.41 

To characterize the convergence of to S*, we assume S* to be within the following 
matrix class 

W(eo) = {M G : < eo < Ai(M) < e^^ < +00, fori = 1, . . . ,p} 

where Aj(M) denotes the ith largest singular value of M. The largest eigenvalue of 5]* is 
then Ajnax •= Ai(S*), and the smallest is Amin := Ap(S*). Similar covariance classes (with 
additional constraints) were considered in the literature, either in the bandable setting 
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(Bickel and Levina 


. 20 


18a: 


Cai et al. 


. 2010b) 


2008bl: 


Cai et al. 


,2 


Oil 


Cai and Liu. 


2011 


)• 



J20ipb|) or in the sparse setting (IBickel and Levina 



We have the following general theorem on the statistical performance of LOREC. 

n{S*) and T = T{L*). Suppose that S* G U{eo), /i(ri)^(T) < 1/54, 



Theorem 1. Let Vl 

and for n > p that 



X = Ci max 




and p = 7A where 7 G [9^(T), l/{6fi{Q))] . In addition, suppose the minimal singular value 
of L* is greater than C2A/^^(T), and the smallest nonzero entries of S* is greater than 
C3X/ . Then with probability greater than 1 — Cip~^^, the LOREC estimator {L,S) 
recovers the rank of L* and the sparsity pattern of S* exactly, that is 



rank(L) = rank(L*) and sign(S) = sign(S*). 

Moreover, with probability greater than 1 — C^p''"'^ , the losses are bounded as follows 



L-V 



< CX and 



S-S' 



< Cp. 



Theorem [T] establishes the rank and sparsity recovery results as well as matrix losses. 
The first fact implies that LOREC can model the data without assuming the rank and 
sparsity pattern, but rather recover those from the data. An detailed explanation of this 
merit is given for factor models in Section 13.21 Note the theoretical choice of A and p are 
suggested here, and we will use cross-validation to pick them in numerical examples. 

The convergence rate for the low rank component is equivalent (up to a \ogp factor) to 
the minimax rate of the low rank regression. The Frobenius norm of L — L* is bounded by 



a rate of n ^/^p^/^ max 



( IChandrasekaran et aL 



y, logp )^/^, where ^(T) above is replaced by the lower bound \/rJp 
2010). In the low rank reg ression setting, a similar rate n^^/^(pr)^/^ 



is minimax optimal fiRohde and Tsvbakovl. l2010f). a nd the additional log(p) factor in our 
rate is usually unavoidable ( Foster and Georgel . ll994 ). We leave the justification of minimax 
optimality to future research. 

The quantities yu(fi) and .^(T) can be explicitly determined if certain low rank and 
sparse structures are assumed. We use the compound symmetry model to illustrate this 
point in Section 13.31 In other models, they can be c alculated or bounded using tools 
from 



Candes et al. 



(120091 ): 



Chandrasekaran et al. 



(120 lOh . The quantity C{T) play s a les s 



important role than in the inverse covariance problem by IChandrasekaran et al.l ( 120101 ). 
because the problem is different. In particular, we have a smaller sample size require- 
ment 0{p) than their 0{p/C,^{T)) {0{p^/r'^) in the worst case). Compare with robust 
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PCA ljCandes et all 120091 ) . which studies noiseless observations, Theorem [T] provide addi- 
tional matrix loss bounds in the noise present setting. 

The assumption n > p is required for technical reasons in order to produce finite sample 



bounds. As we show in Section 15. for all large p > n cases, LOREC performs almost 
uniformly better than other competing methods. This requirement is due to the sample 
covariance used as input in If better estimates are available under stronger assumptions, 
this requirement can be dropped, and even a better rate can be obtained, as we illustrate 
in Section [331 

The convergence of L+S to S* can be obtained using triangular inequalities on the rates 
in Theorem [TJ For instance, we show the spectral norm result after introducing additional 
notations. Let s = max,- 7^0}, which is the maximum number of nonzero entries 

for each column. An iq ball generalization of s can be employed as in 



Cai et al. 



(1201 ih . but 



here we use the io ball to present the main ideas. For the same reason, analysis of other 
types of distributions in that paper is omitted. 

Corollary 2. Under the conditions given in TheoremUl Then the covariance matrix L + S 
produced by LOREC satisfies the following spectral norm bound with probability larger than 
1-C,p~^\ 



L + S 



< C(s^(T) + l)max 




(10) 



Moreover, with the same probability, the Frobenius norm bound holds as 



L + S-S' 



< C(^^(T) + y/r) max , 




'11^ 



The spectral norm bound (fTOj) implies that the composite estimator S = L+S is positive 
definite if Amin is larger than the upper bound. Si nce the individual e igenv alue distance 
are bounded by the spectral norm, see (3.5.32) from lHorn and JohnsonI ( 119941 ). (|T0|) can be 
replaced by the following equation without any other modification of the theorem 



max 



Ai(L + S) - A,(S*) < C(s^(T) + 1) max 



1 /logp [p 



C,{T) V n \ n 



(12) 



This eigenvalue convergence bound further implies the following result concerning the in- 
verse covariance matrix estimation. 
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Corollary 3. Under the conditions given in TheoremUl Additionally, assume that A^^m ^ 
2$ where $ is defined in fll2p . Then the inverse covariance matrix (L + S)^^ produced by 
LOREC satisfies the following spectral norm hound with probability larger than 1 — Cip~^'^ , 



1* \ — 1 



< C(s^(T) + 1) max 



1 /logp [p 



S,{T) V n \ n 

Moreover, with the same probability, the Frobenius norm bound holds as 



x + s 



,-1 



(S*)-^ ^ < C(v^^(T) + v/f) max ( ^ ^^^^^ 




(13) 



(14) 



This shows that the same rates hold for the inverse of the LOREC covariance estimate. 
The spectral norm bound ( fT3l) is useful when the inverse covariance is the actual input 
of certain methods, such as linear /quadratic discriminant analysis, weighted least squares, 



and Markowitz portfolio selection. In Section 15. 2[ we demonstrate the last method on S&P 
100. 



3.2 Implications under Factor Models 



Factor models has been widely employed in studying financial data among many others. 
When applying a factor model (jj]) to data, the following are three important questions that 
LOREC attempts to address. 

1. How many factors should we employ in the model? For example, whether CAPM 
(1-factor) or Fama- French (3-factor) should be used? 

2. How to deal with the situation that factors are unobservable? If certain proxies to 
the factors are available, how do we justify them? 

3. What if the errors are correlated? 

For question 1, we further assume a non-degenerate loading form such that r = rank(B Var(f )B^) 
rank(Var(f)) = K, which holds if B has orthonormal columns for example. Theorem [1] im- 
plies that LOREC can identify the exact rank of B Var(f )B^, and it thus recovers the exact 
number of factors K. In Section 15.21 we apply LOREC to SPlOO stock return data, and 
the result supports the theory of C APM for our da taset, which was independently verified 
on a similar dataset by a rank test ( lOnatskil . 120091 ). 

For question 2, LOREC does not require observing f, where as other approach es may 
have such a requirement, for example in the regression method (iFan et al.l . 120081 ). More 
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importantly, the loading of a single factor can be partially recovered by LOREC. For 
example, the eigenvector associated with the single nonzero eigenvalue is proportional to 
the loading vector B. Indeed, LOREC identifies a loading that is almost identical the 



loading found by CAPM in Section 15. 2[ which employs the market return as a proxy. 

Question 3 entails more flexibility of the modeling, whereas m o st of factor models 
assume a diagonal covariance for t he error component (iFan et al.l. 120081) . More gen- 



eral structure has been consi dered in 
sparse s tructure in the spirit of lBickel~and Levinal (j2008bf ); 



Chamberlain and Rothschildl (Il983[ ) . We consider a 



Rothman et al. 



(120091); 



Cai et al 



( l2010bl ). Numerical results in Section 15.21 indeed shows an interesting pattern of non- 
diagonal entries. 

Even though LOREC can deal with a very general factor mo del setting , for a fair 



comparison, we now restrict our findings to the setting assumed in iFan et al.l (120081 ). and 
compare with their results. They assume S* = is diagonal, and consequently we set 
s = 1. Compared with the rates in theirs, LOREC has the following improvements. 

1. Under the Frobenius norm, they showed both the sample covariance and their esti- 
mator have the same rate 0{n~^^'^pK). On the other hand, LOREC has an improved 
rate 0{n^^^'^{p + p^^'^K^^'^)). A better rate will be obtained in a moment under spike 
covariance in Section 13.41 where additional information of L* is available. 



LOREC shows a eigenvalue convergence rate of 0(n"^/^p^/^) from f fT2l) . which also 
improves their corresponding rate 0{n~^^^pK). Moreover, LOREC can also provide 
spectral loss bounds. 

The inverse covariance estimation also entails some improvement. For instance, under 
the Frobenius norm, the convergence rate 0{n~^^'^{p + p^^'^K^^'^)) improves the rate 
0{n~^^'^pK'^ log^^'^ p) obtained by theirs. 



3.3 Implications under Compound Symmetry 

The parameters ^(T) and /i(fi) in Section [XT] can be expressed explicitly in terms of n and 
p under compound symmetry. Recall the model (jl]) 

S* = a^l'^l + S*. 

In this setting, it is then easy to verify that C,{T) = 2/{y/p + 1) < 2/^/p and /i(fi) < s. 
Consequently, we have the following corollary by substitute these expressions into Theorem 

m 
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Corollary 4. Assume the general compound symmetry model (jlj). Suppose that 5]* G 
W(eo), and n > p > Cqs"^ . Let 



A = Ci 



plogp 



and 



logp 



n V n 

In addition, suppose that > C3A and the smallest nonzero entries of S* is greater than 
C^X/s. Then with probability greater than 1 — C^p^'^'^ , the following conclusions hold for 
LOREC estimator (L, S); 

1. rank(L) = 1; 

2. sign(S) = sign(S*); 



3. 



S-S* 



< Cn-i/2(logp)i/2. 



This Corollary shows that LOREC recovers the support of S* exactl y. The element- wise 
convergence rate in Conclu sion 4 above is consistent with exiting results (IBickel and Levinal . 



2nn8b 



Cai and Liu 



20111). 



3.4 Implications under Spiked Covariance 

Recall the general spiked covariance model 



I3uu^ + S* 



where 7^ if z G © and otherwise. Following lAmini and Wainwrightl ( 120091 ). we 
assume Ui G {+1,-1} / \/k if i G ©, and otherwise. Let the sparsity |©| = k. The 
underlying covariance mat rix is thus /c-sparse, i.e. at most k nonzero elements for each row. 
Bickel and Levinal (j2008bl ) showed that thresholding the sample covariance will produce an 
estimator with a convergence rate O {kn~^^'^ log^^'^ p) under the operator norm. LOREC 
can then achieve better convergence rates if S„ in (fT5|) is replaced by its hard thresholded 
version Hth in the following optimization 



mm 

L,S 



where IT, 



th\ij 



|L + S-Et/,|; + A||L||, + p|S|i| (15) 
[S„]jjl > t}. The threshold r is shown to be of the order 



\/ (log p)/n, and can be chosen adaptively using ICai and Liul ( 201ll ). As before, it can 
be verified that ^{T) < 2/\/k and /i(fi) = s in this special case. 
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Corollary 5. Assume the general spike covariance model ([5]). Suppose that S* G U{eo), 
n > Cok"^ logp, and k > Cis"^. Let 



A = C2 (fc + s) 



logp 



P 



C3 (v^ + v^) 



logp 



n \ / \ n 

In addition, suppose that (5 > C^kX and the smallest nonzero entries of S* is greater than 
C5A/S. Then with probability greater than 1 — Cgp^*"^, the following conclusions hold for 
the LOREC estimator {L,S) with the input l^th in (flSl) ; 

1. rank(L) = 1; 

2. sign(S) = sign(S*); 



3. 



L — /3uu 



S-S* 



T 



<C{k + s)n~^/\\ogp) 



1/2. 



n 



-1/2 



(logp) 



1/2 



Due to the replacement S^/j, the assumption n > p is weakene d hy n > Ck^ log p here 



when k grows not too fast (say, smaller than p^ with ^ < 1). lAmini and Wainwright 
( l2009h showe d that this is exac t ly the lower bound for recovering the support of u using 
sparse PCA ( 1 Johnstone and Lu . One additional thresholding step is needed for our 

method to recover the support of u, which is a consequence of Conclusion 3 above. First 
by the sin0 theorem of Davis-Kahan ( Davis and Kahan . 197o[ l. the spectral bound above 
implies that ||uu^ — uu^|| < 1/ {2k) for sufficiently large C4, where u is the eigenvector of 
L associated with the single non-zero eigenvalue. Therefore, thresholding uu-^ at the level 
1/ {2k) will recover the exact support of uu^, and consequently it will recover the support 
of u. 



4 Algorithms and Complexity 



We introduce our efficient algorithms for solving LOREC in this section. Our method is 
based on Nesterov's first order method, and we obtain explicit bounds on the iteration 
complexity. 



Nesterov's methods (INesterovl . 119831 ) have been widely used for smooth convex opti- 
mization ^£roblems.^_j^ai]__be_a£c^^ computational convergence rate of 

where t is the number of iterations. 



0(1 /t'^) (INesterovl. 



1983 



Nesterov and Nesterov. 



2004 



Nemirovsky and YudinI (1l983l ): iNesterov and Nesterov! (12004J ) showed that this is the opti- 



mal convergence rate under the first-order black box model, where only the function values 
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and the first-order gradients are used. This method also enjoys the advantages of frugal 
memory usage and fast empirical convergence, which are crucial for large-scale problems. 
We will adopt this method to solve our objective function. 

The LOREC objective in (fT5|) is, however, non-smooth, due to the nonsmooth penalties- 
the nuclea r norm and ^^ norm. The subgradient method t end t o be slow, as it converges as 



0{l/\/t) (IBertsekas et a. 



19991: 



Nesterov and Nesterov 



20041 ). We thus modify acceler- 



ated gradient algorithm (jji and Yd . l2009[ ) to solve the LOREC criterion with the composite 



penalty. One key idea is that we separate the two penalties into two easy-to-solve opti- 
mizations. Numerically, each iteration of our algorithm can be formed by simple algebraic 
operations. It is thus suitable for large-scale problems. This algorithm is also proved to 
enjoy the optimal convergence rate 0(l/t^). 



We now introduce our algorithms. Let /(L, S) 
respect to either L or S is of the same form 



|L + S 



-'71 I p 



/2. The gradient with 



vec(VL/) = vec(Vs/) = vec(L + S - S„) 

where Vl/ and Vs/ are two matrix gradients. Thus, the overall vector gradient with re- 
spect to (vec(L)-'", vec(S)'^)^ is a column stack of the two gradients, that is vec(VL,s/(L, S)) : 
(vec(VL/)"^, vec(Vs/)^)"^. The gradient is then Lipschitz continuous with a constant ^ = 2, 
that is 



||vec(VL,s/(Li, Si)) - vec(VL,s/(L2, Sa))!!^ < ^^J\Ll - U\l + |Si - ^2\l (16) 
for any Li, Si, L2, S2 G W^'^. For simplicity, denote the LOREC objective function by 

1 



Similar to 



F(L, S) = - |L + S - S„|^ + A||L||, + p|S|i. 



Ji and Yd (l2009l ). we seek the iterative update (L(t), S(()) given (L(t_i), S((_i)^ 



by minimizing the following objective over (L, S) 

S), (L(i_i), S(t_i))) = /(L(t_i), S(t_i)) + (VL/(t-i), L - L(i_i) 



> + (Vs/(i-i),S-S(i_i)> 



+ 2!-'^ ^ L(t_i)|^ + -|S — S(t_i)|^ + A||L||* + p|S|i 

where (Vl/^, Vs/(t)) := (VL/(L(t), S(t)), Vs/(L(t), S(t))). Instead of adaptively search for 
/, one can simply choose a fixed stepsize / > £ = 2, as we will do in this paper. The 
initializer (L(o), S(o)) can be simply set to be (diag (S„) , diag (Sn)) /2. A key observation 
is that the objective Qi is separable in L and S, and thus it is equivalent to seek L and S 
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respectively in the following two objective functions: 



L(t) 
S(t) 



argmin{-|L - (L^t-i) - -jV f(^t-i))\F + ^l|L||*} 

L ^ t 

argmin{-^|S - (S^t-i) - f{t-i))\l + P\%}- 
s ^ t 



(17) 
(18) 



Objective (fT7|) can be solved easily by soft-soft-thresholding the singular values o f 



see also 



l/OV.f(<:- i )i and this is summarized in the following lemma from 



Cai et al. 



(l2010af ). 



Ji and Yel fl2009f l. 



Lemma 6 (Ji and Ye (2009), Cai et al (2010)). Let Y G 7^™^" and let Y = UDV^ by 

the SVD ofY where U G TZ^^^ and V G T^"^*" have orthonormal columns, D G 7?.'"^'' 
diagonal, and r = rank(Y). Then 

TxiY) :=argmin{^|M-Y|| + A||M||J 
M 2 

is given by T\{Y) = UDaV"^, where Da is diagonal with (Dxjii = max{0,Djj — A}. 

Objective (ITSl) can be solved by soft-thresholding of each entry in S(f_i) — (l//)V/(t_i), 
which is summarized by the following lemma. The proof is straightforward and thus omit- 
ted. 



Lemma 7. Let Y eW' 



Then 



1 



TpiY) := argmin{-|M - Y||, + A|M|i} 



M 2 

is given element-wise by {Tp{Y))ij = sign(yij) max{0, \Yij\ — p}. 

The stepsize / can be adaptively chose n by the following lemma. This lemma simply 



Ji and Yd (|2009[ ) to our composite penalization. This lemma is 



extends Lemma 3.1 from 
also needed for proving Theorem [9l 

Lemma 8. Let 

(L, S) = di{t, S) = argminQ,((L, S), (L, S)). 

(L,S) 

// the following stepsize assumption is satisfied for some I > 

F(L,S)<g,((L,S),(LS)). 

Then for any (L, S) we have 

I 



(19) 



F(L, S) - F(L, S) > /(L - L, L - L) + /(S - S, S - S) + -|L - L||. + -|S - S 
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By the Lipschitz continuous property ( fT6|) for all (L, S), a fixed choice I > i = 2 shall 
make the equation (IT^ always satisfied, but an adaptive stepsize satisfying Lemma E] could 
give better numerical convergence in practice. A line search strategy then could be used to 
adapt / from initializer /(q). We didn't implement this line search in the current release of 
our R package, as the convergence is already very fast in practice. 

Consequently, the extended gradient method for LOREC is summarized in Algorithm [TJ 



Algorithm 1 Extended gradient method for LOREC. 
Initialize L L(o), S ^ S(o) and / = /(q). 
repeat 

while F(rf,(L(t_i), S(f_i))) > Qi{di{L^t-i),Sit-i)),'i^{t-i),S(t-i)) do 

/ ^ al 
end while 

S(()) = ci;(L(t_i), S(i_i)), k = I 
until convergence 



We now establish the global convergence rate of Algorithm [T] by the following theorem. 

Theorem 9. Let (L(j),S(t)) be the update generated by AlgorithmlJ\ at iteration t. Then 
for any t > 1, we have the following computational accuracy bound 



F(L(,),S(t))-F(L, S) <2J 



^(0) 



If + |s 



(0) 



|2 

If 



It is known that the above convergence for Algorithm [T] can be improved by mixing two 
consequent updates. This accelerated gradient version is given by Algorithm [2l 

We es tablish also the global co n vergence rate of Algorithm [2l The proof is standard 
following iBeck and TebouUd ( l2009l ): |Ji and Yd ( 120091 ). and is thus omitted. 

Theorem 10. Let (L(f), S(t)) be the update generated by Algorithmic it iteration t. Then 
for any t > 1, we have the following computational accuracy bound 



F(L(i),S(i))-F(L,S) < 



^(0) 



l^ + |S 



(0) 



12 

If 



(t + l) 



A crude order of computational cost is 0{p'^/^/e). Simply, the operations at each 
iteration is 0{p^) due to full SVD, which is the same order of computation of least squares 
with p variables. This can be easily improved using partial SVD methods as discussed in 
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Algorithm 2 Accelerated gradient method for LOREC. 



Initialize L ^ L(o) = Y(i), S <— S(o) = Z(i) and / = /(o). 
repeat 

while F(rfi(Y(t_i),Z(i_i))) > g,^(Y(t_i),Z(i_i)),Y(t_i),Z(t_i)) do 

/ ^ al 
end while 

ht) = ^ 

(L(t), S(j)) = di(Y(t_i), Z(t_i)) 
Q^t+i — 2 

(Yi+1, Z(j+i)) = (L(t), S(f)) + (|^)[(L(j), S(t)) - (L(t_i), S(t_i))] 
until convergence 



Section [71 The numerator in the bound above is at most O(p^), and the crude bound for 
an e-optimal solution is immediate. For a comparison, the linear progra mming for O(p^) 



Cai et al. 



variables takes 0{p^/ log(e)) operations using interior point methods, as in 
for the inverse covariance problem. Algorithm [2] is 0{p'^) faster than linear programming 
if the precision requirement (e) is not high. 



5 Numerical Studies 

In this section, we compare the performance of LOREC with other methods using simulated 
and real data examples. An R package has been developed and is publicly available through 
CRAN. 



5.1 Simulation 

The data is simulated from the following underlying covariance models: 

1. factor: S* = UDU"^ + I where U G MP^^ with orthonormal columns uniformly 
random, and D = diag(8, 8, 8). 

2. compound symmetry: random permutation of E* = 0.211^ + S, where the sparse 
component is block diagonal with each square block matrix B of dimension 5, and 
B = 0.411^ + 0.41. 
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3. spike: S* = (3uu^ + S*with (3 = 16, \{j : uj ^ 0}\ = p/2 and S*a block diagonal 
matrix. Each block in S* is 0.411^ + 0.61 of size 4. 



We com pare the perfori nance of LOREC with t he sample covariance, thresholding (iBickel and Levina 
2008b[ ) and Shrinkage jledoit and Woli hoO^ . 



For each model, 100 observations are generated from multivariate Gaussian distribution. 
The tuning parameters for the LOREC, thresholding and shrinkage e stimators are pick ed by 



5-fold cross validation using the Bregman loss, which is also used in lCai et al.l (120111 ). The 
sample covariance estimator does not have a tuning parameter, and is calculated using the 
whole 100 samples. The process is replicated by 100 times with varying p = 120,200,400. 

Table [1] illiterates the comparison of performance measured by the operator norm, the 
matrix ii norm and the Frobenius norm. The LOREC estimator almost uniformly outper- 
forms all other competing methods on these models under three losses, except the spectral 
loss for the compound symmetry model. In this case, LOREC is the second best after the 
sample covariance. 

The LOREC estimator can also provide recovery of the rank and the sparsity patterns. 
Table [2] lists the frequencies of exact rank and support recovery. The singular values and 
element values are considered to be nonzero if their magnitudes exceed 10~^, since the 
error tolerance is set to 10~^. Note the true rank of the low rank components in these three 
models are 3, 1, and 4 respectively. It is clear that LOREC recovers both the rank and 
sparsity pattern with high frequencies for all the three models. 

5.2 Application to Portfolio Selection 

We n ow compare perfor mance when applying covariance estimation to portfolio construc- 



tion. 



Markowitzl (119521 ) suggested constructing the minimal variance portfolio w by the 



following optimization 

min w^Sw subject to: w'"! = 1 and w'^/i = q 

where /i is the expected return vector and q is the required expected return of the portfolio. 
The solution of the problem suggests the following portfolio weight 

where Ai = l^Hl, A2 = l^S/i, and A3 = fi^Tifi. The global minimal variance return 
without constraining q is obtained by replacing q = A2/A3. The portfolio is constructed 
using information from I], and we shall compare the performance using various covariance 
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Table 1: Average (SE) losses of LOREC, sample and thresholding estimators. The one with the best performance is 
shown in bold. 



Spectral Norm 







Factor 






Compound Symmetry 






Spike 




p 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


120 


4.91(0.06) 


5.42(0.07) 


7.35(0.06) 


5.63(0.05) 


7.50(0.21) 


7.15(0.18) 


7.88(0.18) 


7.87(0.23) 


5.88(0.12) 


5.98(0.12) 


6.87(0.18) 


7.75(0.16) 


200 


5.94(0.06) 


6.91(0.08) 


7.84(0.003) 


6.71(0.04) 


11.93(0.33) 


11.26(0.23) 


12.37(0.32) 


13.46(0.42) 


7.20(0.16) 


7.82(0.13) 


14.30(0.14) 


10.38(0.15) 


400 


7.16(0.05) 


10.33(0.07) 


7.92(0.001) 


7.57(0.006) 


23.62(0.65) 


22.00(0.58) 


24.73(0.68) 


27.44(0.79) 


9.80(0.17) 


11.36(0.14) 


15.48(0.02) 


13.43(0.05) 


Matrix Norm 






Factor 






Compound Symmetry 






Spik 






P 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


120 


13.19(0.18) 


16.76(0.13) 


21.96(0.18) 


15.92(0.17) 


15.54(0.27) 


16.91(0.28) 


18.95(0.30) 


15.96(0.23) 


11.61(0.15) 


16.29(0.19) 


15.68(0.15) 


14.33(0.14) 


200 


15.12(0.15) 


24.66(0.17) 


21.41(0.02) 


17.91(0.11) 


26.62(0.36) 


28.42(0.35) 


32.25(0.52) 


27.61(0.39) 


14.20(0.15) 


24.53(0.22) 


19.58(0.18) 


17.49(0.09) 


400 


19.93(0.16) 


44.28(0.17) 


24.18(0.01) 


22.71(0.04) 


55.66(0.93) 


59.58(1.07) 


67.56(1.38) 


57.35(0.88) 


17.41(0.15) 


44.16(0.21) 


19.79(0.12) 


19.72(0.06) 



Frobenius Norm 







Factor 






Compound Symmetry 






Spik 






p 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


LOREC 


Sample 


Threshold 


Shrinkage 


120 


7.99(0.05) 


14.63(0.04) 


13.56(0.02) 


10.14(0.05) 


9.17(0.15) 


12.49(0.09) 


13.89(0.10) 


11.46(0.14) 


8.24(0.07) 


13.85(0.06) 


14.26(0.06) 


11.03(0.07) 


200 


9.62(0.05) 


22.58(0.04) 


13.87(0.002) 


11.94(0.04) 


14.35(0.24) 


20.48(0.11) 


22.91(0.17) 


19.13(0.26) 


10.47(0.09) 


21.87(0.05) 


18.40(0.06) 


14.31(0.07) 


400 


11.81(0.04) 


42.72(0.04) 


14.10(0.002) 


13.21(0.01) 


27.55(0.51) 


41.01(0.34) 


45.91(0.41) 


38.13(0.47) 


14.56(0.10) 


41.92(0.05) 


20.25(0.02) 


18.98(0.03) 



Table 2: Recovered rank and support by LOREC over 100 runs. 

Rank Recovery 



p 


Factor 


Compound Symmetry 


Spike 


%(rank=3) 


mean(se) 


%(rank=l) 


mean(se) 


%(rank=l) 


mean(se) 


120 


80 


3.42(0.09) 


71 


1.52(0.09) 


69 


1.42(0.08) 


200 


99 


3.01(0.01) 


82 


1.46(0.11) 


78 


1.22(0.04) 


400 


91 


2.91(0.03) 


80 


1.56(0.13) 


95 


1.01(0.02) 


Support Recovery 




Factor 


Compound Symmetry 


Spike 


P 


%TP 


%TN 


%TP 


%TN 


%TP 


%TN 


120 


100.0(0.0) 


99.97(0.01) 


99.95(0.02) 


91.96(0.29) 


97.59(0.19) 


92.31(0.18) 


200 


100.0(0.0) 


100.00(0.001) 


99.91(0.02) 


94.76(0.18) 


97.04(0.20) 


94.42(0.14) 


400 


100.0(0.0) 


100.0(0.0003) 


99.81(0.02) 


97.13(0.09) 


94.28(0.30) 


97.57(0.14) 



estimators. In addition to the covariance matrix estimators considered in t he simulation, we 
have a lso included the another shrinkage estimator from the experiments in 



Ledoit and WoU 



(120031 ): "shrink to market". The details of this estimator can be found therein. 

The monthly returns of stocks listed in SP 100 (as of December 2010) are extracted 
from Center for Research in Security Prices for the period from January 1972 to December 
2009. We remove the stocks with missing monthly returns within the extraction period 
due to company history or reorganization, and 53 stocks are retained for the following 
experiment. All monthly returns are annualized by multiplying 12. The sample covariance 
and correlation are plotted in Figure [TJ Most of the entries are not small and negligible. 
Similar observations (not shown) can be drawn after removing risk-free returns. 

We conduct the following s trategy in building our portfolios and evaluate their fore- 
casting performance, similar to Ihedoit and Woli (2003). On the beginning of testing year 
y, we estimate the covariance matrix from the past 10 years from January of year y — 10 
to December of year y — 1, and then we construct the portfolio using (pOj) . and hold it 
throughout the year y. The monthly returns of the resulting portfolio are recorded. In 
thresholding, shrinkage and LOREC, we pick the parameters that produce the smallest 
overall variance out-of-sample for each testing year from y — 5 to y — 1 hj constructing 
portfolios using their past 10 year covariance estimates. Table |3] shows that LOREC almost 
outperforms all other competing methods. It is only slightly worse by a nominal amount 
than "shrink to market" when the restriction q is set to be 10%. 

LOREC can also recover the decomposition of the low rank and sparse components in 
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Table 3: Average (SE) of realized return variances for year 1987-2009. 





Unconstrained 


10% Constr. 


15% Constr. 


20% Constr. 


Sample 


0.379(0.049) 


0.403(0.054) 


0.370(0.047) 


0.377(0.046) 


Thresholding 


0.376(0.047) 


0.400(0.052) 


0.365(0.045) 


0.371(0.044) 


Shrink to identity 


0.284(0.035) 


0.290(0.037) 


0.263(0.033) 


0.265(0.034) 


Shrink to market 


0.249(0.029) 


0.272(0.034) 


0.249(0.030) 


0.251(0.031) 


LOREC 


0.227(0.038) 


0.278(0.048) 


0.236(0.041) 


0.221(0.038) 



stock data. Indeed, LOREC identifies a rank one component, existing in the covariance of 
each 10-year periods ending from 2001 to 2009, but n one for the ear lier years. This single 
rank finding is consistent with the empirical finding of lOnatskil (120091 ) on a similar dataset. 
As discussed in Section 13. 2[ the singular vector of an rank 1 component is equivalent (up 
to a multiplying constant) to the loading in a single facto r mode l . Alt ernatively, using a 
single factor model. Capital Asset Pricing Model (CAPM) ( ISharpd . 119641 ) . the loading (also 
called beta's) can be estimated with additional proxies. Here the usual choice of market 
returns is used. In Figure [2l we compare the loading estimates from LOREC and CAPM 
by plotting the cosine of the angles between these two vectors for year 2001-2009. The 
plot shows that the angle of these two loading vectors is close to consistently, almost 
monotone decreasing from 27.33 to 9.90 degrees. It suggests that LOREC verifies a similar 
loading as CAPM, though it does not need to observe the market proxy like CAPM. It 
is also interesting to notice that the trend that the angles between these two estimates 
become smaller as time progresses. 

Besides the low rank component recovered, LOREC provides a sparse component es- 
timate for each year, and the Heatmap of support recovery through 2001-2009 is plotted 
in Figure [31 For instance, a high correlation is recovered between Texas Instrument (TI) 
and Hewlett Packard (HP) through all these years. A possible explanation is that they are 
both involved in the business of semiconductor hardware, and they can be infiuenced by 
the same type of news (fiuctuation). Other stock pairs of significant nonzero correlations 
are also examined, and they seem to reasonable within this type of explanation. 
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Figure 2: Correlations and angles between the loading vectors recovered by LOREC and 
CAPM. 




2002 2004 2006 2008 2002 2004 2006 

year year 



6 Proofs 



6.1 Proof of Theorem [T] 



The proof modifies the arguments used in IChandrasekaran et al.l (120 10[ ) (hereafter CPW). 



The notation used here is the same, but our objective function is different. A key difference 
in analyzing our method is that the Frobenius norm approximation is used instead of the 
likelihood loss. Thus the fisher information matrix is simply an identity operator X* = X. 
Consequently, the residual term of their second order approximation is not needed in our 
proof. For simplicity of the presentation, we rescale the penalization parameters in ([9]) to 
have the following equivalent form 

min I J |L + S - + Xn [||L||, + 7|S|i]l (21) 



S,L [2 

where 7 = p/A > and A„ = A. 

The proof outline is to show that the solution of f l2T]) is equivalent to the solutions 
the objective function under additional constraints, which we will introduce later. The 
statistical properties of the solution are then analyzed with these additional constraints. 
Because these constraints are shown to be inactive at the optimum, previous analysis will 
also apply for the unconstrained one ( 12T|) . 
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Figure 3: Heatmap of the support of the error component by LOREC from 2001-2009. 
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The performance of LOREC is measured using the dual norm of the composite penalty 

^^(S,L) =max{|S|^/7, ||L||}. 

It is essential to characterize the performance under a slight perturbation of the space T. 
The perturbation strength is measured using the the twisting measure 

max llfT^T.-T^T,] (N)|| 

where Vt is the projection operator onto the space T. Denote the addition operator by 
A-i and its adjoint operator will be denoted by . The following proposition plays a 
fundamental role in our analysis, which parallels Proposition 3.6 in CPW. The proof is 
similar to CPW, and thus it is omitted here. 

Proposition 11. Suppose that n{9)^{T) < 1/54 and 7 e [9^(7), l/(6/x(Q))]. Then we 
have the following conclusions for y ^ D, x T' with p{T, T') < ^(T) /2; 

1. The minimal gain is bounded below as 

And this implies that for all (S, L) e y , 

g,{Vy.A^Ary{S,l.))>\g,{SM- 

2. The effect of elements in y — x T' on the orthogonal complement y-^ — Q-*- x T'-*- 
is bounded above as 



1 

< -. 



Vy.AUVy {VyAUVyy' 

And this implies that for all (S, L) e y, 

g,{Vy.AUVy{S,L)) < lg,{ryAUVy{S,L))- 
First, we analyze the following objective with additional tangent space constrains: 



inin I ^ |L + S - + A„ [||L||* + 7|S|i] 

S,L I z 

s.t. Sen LeT' 



(22) 



where T = T(L*) such that p(T, T') < ^{T)/2. The follow proposition parallels Proposition 
D.2 in CPW. 
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Proposition 12. Let p{T,T') < ^(T)/2 and define 

r = max {4 [g^iA^En] + g^iA^Cr') + A„] , HCt'L} 

where Ct' = Pj'/x(L*) and 'En = S* — S„. T/ien t/ie solution {Sn, i^T') of fl22|) satisfies 

g,{Sn-S*,LT'-L*)<2r. 

Proof. Let (As, Al) = (S^ — S*, Lt' — L*). The convex objective at the optimum (Sn,Lr') 
satisfies, for some Lagrange multipliers Qq± G fi^ and Q^^/x G T'-*-, such that 

Sq + Lt' — S„ + Qq± g —Xn'yd Sn , Sn + L^' — Sn + Qt'^ ^ — 

1 

Projecting these quantities onto the space y = Q x T' , and define 

Pq(Sq + Lj-/ — S„) = Zn, 'PT'(Sn + Lt' — S„) = Z^/. 

It verifies that |Zq|^ = A„7 and ||Zr/|| < 2A„. Denote Z = (Zn,Zr'); and it is easy to 
check 

VyA^{Sn + LT'-^n) = Z. (23) 
Reformulating the components in the bracket, 

Sn + Lt' - S„ = E„ + AVyiAs, Al) - Ct'. (24) 

Let 

^(^S, 5l) = (5s, 5l) - {VyAUVyr'{VyA^[E„, + ^^^(^s, ^l) - Ct'] - Z}. 

A point {Ss,5i,) e y is a fixed point of J" if and only if VyA^[En + AVy{5s, 5l) - C^'] = Z. 
Vy{As, Al) is thus a fixed point in 3^ in light of ( !23|) and (p^ . and the uniqueness follows 
from that it is an unique solution of (122|) . The following arguments show that this unique 
fixed point lying within in the ball = {(^S; (^l) I d-yi^s, ^l) ^ f^, {^s, <^l) £ 3^}- 
In order to apply Brouwer's fixed point theorem, it verifies that 

9jiF{6s, 5l)) = g,iiVyAUVy)-'{VyA^[-En + C^^] + Z}) 

<2g^{PyA^[-En + CT'] + Z) 

<A{g,{A^[-En + CT']) + Xn) 
< r. 

Thus, g^(Vy{As, Al)) < t by the fixed point theorem. Therefore, 

(?^(As, Al) < (7^(Py(As, Al)) + ||Cr,|l2 < 2r. 

□ 
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In order to verify that our solution recovering the underlying rank and sparsity pattern, 
we consider explicitly enforcing such a constraint by the set 



M = {(S,L)|SGf](S*), rank(L) < rank(L*), \\Vt±{L - L*)\\^ < ^{T)Xn 

(25) 

g^iA^AiS - S*, L - L*)) < llA,}, 
and denote the minimizer (S_a4,La^) of the following constrained optimization problem 



min I ^ |L + S - S„|^ + A.„ [||L||^ + 7|S| 

S.L I 

s.t. (S,L)eM. 



(26) 



Proposition D.3 and Corollary D.4 in CPW involves on the properties of this constrain set 
Ai only regardless of the objective function, it also holds for our problem here by setting 
the parameters therein a = f3 = ip = D = 1, 6 = 0, u = 1/2, C5 = 24010, and Cq = 49/3. 

The equivalence of two types of constraints discussed before can also be established, by 
studying the spacial case of the tangent space constraint optimization. Define Tm = T{Lm) 
and the optimization in (l26l) is equivalent to the following under suitable conditions: 



(27) 



min <( i |L + S - S„,|^ + A„ [||L||^, + 7|S| 



S,L [2 

s.t. SeQ L e Tm. 

The conditions we need comes from the following proposition. 

Proposition 13. Let 7 G [9^(T), l/(6/i(r2))], and suppose the minimal singular value of 
L* is greater than and the minimal nonzero entry of S* is greater than where 

C5 = 24010 and C& = 49/3. Suppose g^{A^En) <^.Th en we have that 

(Sn,L'r^) = {Sm,^'m)- 
Proof. Similar to D.9 in CPW, we only need to verify that (Sq, L^^) satisfies the following 

g^UiSn - S*,Lt^ - L*)) < 11A„. 
Let (As, Al) = (Sn — S*, Ly^ — L*). The optimality condition implies that 

20 

g^{VyA^AVy{As, Al)) < 2 [A„ + g^^En) + g^iA^CrJ] < —Xn 

28 



where we have used g^(A^'Ein) < and g^/{A^CTj^) < 4^. Therefore, 



g.iA^AiAs, Al)) < g,{VyA^AVy{ As, A^)) + g,{Vy.AUVyiAs, A^)) + gM^CrJ 

20, 10, A„ 
y y io 



□ 



This proposition estabhshes the equivalence that rank(Lr^) = rank(L*) and T(L 



Tm- 

Finally, we need to show that (Sf7,LT^) is also the solution of the original problem 
( !2T|) . without the tangent space constraints. 

Proposition 14. Let 7 G [9^(T), l/(6/i(fi))], and suppose the minimal singular value ofh* 
satisfies a > o,nd the minimal nonzero entry of S* is greater than where C^and 

Cq are defined in Proposition [731 Suppose g^{A'^'E'n) < ff- Then we have that {Sq,,\^Tm) 
is optimal of the original problem f l^T]) . 



Proof. The optimality condition of (l27j) implies that there exist Lagrange multipliers Qj^± G 
fi-*- and Qjix G T;^^ such that 



Sq + L^' — 5]„ + Qj.x G — A„9 



Restricting to the space y = Vt ^ Tm-, we have 

VyA\Sn + tTj^-^n) = ^ (28) 

where Z = (-A„7 sign(S*), -A„UV^) and Lt^ = UDV^ is a reduced SVD of Lr^. Thus 
5f^(Z) = An- All we need to show is that 



g^{Vy^A'^ 



Sf7 + 



Let (As, Al) = (Sf7 - S*, L^^ - L*). Recall that 



S„ = E„ + ^Pv(As,Al)-C 



'A1 ' 



and then we have from (128 



Py^UP3;(As, Al) = Z + VyA^-'E.n + Ct^ 
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Reorganize terms to verify that 



Therefore, it is equivalent to show 



g^iVy^AUVyiAs, Al)) < A„ - g^iVy^A^ {E„ - Ct^}). 
The above follows from Proposition [TTl 

g,iVy.A^AVyiAs,A^)) < ^g,iVyAUVy{As, A^)) 



< 



\n + 2 — 

9 

7 



11 ^ 

< —A, 
- 18 



< A„ — — 

io 



< Xn-g^{Vy±A^ {Er^ + CT'}). 



□ 



Put it all together, under the assumptions before, we only need to verify that (7^(^^E„) < 
A„/18. Standard probability argument, see (58a) from IWainwrightl (120091 ) . shows that 



P (^||E„||2 > Ci^l^ < C2expi-Csp) 



(29) 



where Ci may depend on A^ax- It is also standard, see iBickel and Levinal ( l2008bl ). 



P\\\En\L>cJ'^]<C,p'-^ 



n 



(30) 



Therefore, with probability greater than 1 — C^p 



g^i^A'^En) < Cmax 



e(T)v n 

The bound r in Proposition [12] is then bounded by 

40 



logp jp 

n 



9 



■A„. 
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6.2 Proof of Corollary [2] 

It is a fact that 



L + S - L* - S* 


< 


L 


-L* 


+ 


S 


- s* 




< 


L 


-L* 


+ 


S 


- s* 



< 



L-L* 



S-S* 



where we have used the fact that sign(S) = sign(S*) in the last inequahty. Setting 7 = 
9^(T), equation (fTU]) is obtained by applying the the bounds in Theorem [T] for the two 
components above. 

The second part follows that 



L + S-L*-S* 


< 


L - 


L* 


+ 


S-S* 






F 






F 




F 




< 




L 


-L* 


+ ^fps 



where we have used in the last line the fact that L — L* has at most rank 2r. Again applying 
the bounds in Theorem [1] to yield the desired bound . 

6.3 Proof of Corollary [3] 

The first bound follows that 



:l + s)-i-(s*) 



< 



< 



(L + S)-i L + S-S 
(L + S)-i L + S-S 
L + S-S 



i*\ — 1 



—1 I 



A2. 

mm 



< 2A^:„ because of (fT2l) and the 



where the last inequality uses the fact that (L + S)^^ 
assumption Amin > 2$. Substitute the bound ( ITOi) to the above to reveal the result. 

It is a fact that \M1M2\p < \\Mi\\ \M2\p for matrices Mi and M2 of proper sizes. This 
fact implies that 
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i*\ — 1 



< 
< 



A2. 

mm 



L + S - S* 
L + S - S 
L + S - S 



(L + S)-i 
2 



The rest of the proof is similar to the first part. 



6.4 Proof of Corollary \5\ 

The entries [S* — S*]^^. has the following values 



(3/k iit=j 

(3 sign(MjMj) /k Hi ^ j and UiUj ^ 

otherwise 



The nonzero entries will be of the mag nitude 0(fcn-i/Mog^/2p). The nonzero entries of 
S* will be at least [kj s)n~^l'^\o^^'^ j). Moreover, the overall sparsity is upper bounded by 
+ s. Thus, applying the hard thresholding rule at level C\n~^^'^ log^^^ p will produce an 

'ollowing bounds with probably greater than 1 — Cip"*"^, 



estimator that satisfies the 



due to 



Bickel and Levinal fl2008br ). 



|St„-S*L<C3A/^ and ||Si^ - S*|| < C4(fc + s) 



n 



n 



(31) 



The rest of the proof follows by replacing the probability bounds (l29l) and (1301) in the proof 
of Theorem [1] with ( I3T1) . 



6.5 Proof of Lemma [8] 



This is similar to the proof of Lemma 3.1 from I Ji and Yd (120091 ). Since the function /, the 
nuclear norm and the ii norm are convex, we have 

/(L, S) > /(L, S) + (L + S - L - S, V/(L, S)) 
A||L||, > A||L||, + A(L-L,9||L||,) 
p|S|i>p|S|i + p(S-S,9|S|i) 
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where V/(L, S) := Vl/(L, S) = Vs/(L, S), 9||L||* and d\S\i are the matrix subgradients 
of the nuclear norm and the ii norm respectively, at L and S. Summing up all above to 
yield 

F(L, S) > /(L, S) + (L + S - L - S, V/(L, S)) 

+ A||L||, + A(L - L,9||L||,) +p|S|i + p(S - S,d\S\i) 

> QiiiU S), (L, S)) + (L - L, V/(L, S) + A9||L||,) + (S - S, V/(L, S) + pd\%) 
--|L-L||--|S-S||. 
Therefore, under the assumption (fT9l) . it follows that 
F(L, S) - F(L, S) > F(L, S) - QiHU S), (L, S)) 
>(L - L, V/(L, S) + A9||L||,) + (S - S, V/(L, S) + p9|S|i) - 1|L - L|| - ^|S - S\l 



>/(L-L,L 


-L) + Z(S-S,S 


-s)- 


2' 


~ L 1 - - S - S\p 


>/(L-L,L 


-L) + /(S-S,S 


-s) + 


2' 





where we have used the fact in the third line above that for any (L, S) 

(L - L, V/(L, S) + A9||L||, + /(L - L)) + (S - S, V/(L, S) + pd\S\i + /(S - S)) > 0. 

6.6 Proof of Theorem [TO 

The proof is standard for Nesterov's methods. Applying Lemma |8] with (L, S) = (L,S), 
(L, S) = (L(t), S(t)), and / = l(t+i) < ai, it yields that 

2 ^ ^ - , , ^ 
(-^(L, S) - F(L(t+i), S(f+i))) > |L((+i) - L|^ + |S(f+i) - S\p - \L(^t) - Mf ~ \^(t) - S|p. 

Summing over t = 0, . . . , A; — 1 to reveal 

k~i „ 

^(F(L(,_,i), S(,+i)) - F(L, S)) < ^(|L(o) - L|| + |S(o) - S|^). 
t=o 

Therefore, the fact F(L((_(_i), S(t+i)) < -F(L(j), S(t)) and F(L(fc), S(fc)) > F(L, S) implies that 

fc(F(L(,,),S(,))-F(L,S)) < ^(F(L(,+i),S(,+i))-F(L,S)) < ^(|L(o) -L|| + |S(o) -S||). 
Replace £ by 2 to prove the claim. 
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7 Discussion 



In this paper, we propose a new covariance structure framework that can be verified in 
many popular statistical models, and consequently a new estimator, LOREC, is proposed 
to exploit this structure. The statistical performance is theoretically justified using various 
losses, and the computation is proven to enjoy fast convergence. The merits of this new 
approach is illustrated using numerical examples. 

One direction is to further understand the theoretical properties under various norms, 
especially the optimal rates. The Frobeni us norm result here is c onjec tured to be minimax 
optimal using a similar argument from iRohde and Tsybakovl ( l2010l ). but it remains to 
be rigorously iusti fied. Moreover, th e convergence rates are shown to be different under 
different norms by 



Cai et al. 



(l2010bl ). and it is interesting to study if such a phenomenon 
also exists under our covariance structure. 

The results can be further improved with stronger structural assumptions. For example, 
we consider the factors to be unobservable here. If one suppose reasonable proxies are 
available, it is an interesting problem to incorporate such additional information in our 
approach, and improvements may be achieved hence. Moreover, the conditions imposed 
are to derive both support/rank recovery and matrix loss bounds. If one is interested in 
just the latter goal, without separating the two components, we suspect a different set of 
conditions shall suffice. Different conditions should also apply if one is only interested in 
determining the number of factors as in signal processing. It is also interesting to explore 
other special cases of our general results. We will report the results for multiple spike 
covariance model elsewhere. 

The choice of tuning parameters can be chosen theoretically as in the main results or 
chosen using the cross validation in p ractice. It is of great int erest to study data-driven 
choices. We found the development in 



Johnstone and Lu 



is particularly promising. 
The convex optimization approach is adopted for computational efficiency. It is inter- 
esting to study adaptive versions of our proposal to ameliorate the bias. For example, 
one may cons i der re placing the penalty func tion pen (•) i n ([9]) by either the adaptive lasso 
penalty (Zou, 2006 ) or the SCAD penalty (jFan and Li . 2001 ). We leave this to further 
study. It is also interesting to study constrained ve rsions of our ap proaches, which has 



shown some improvements in estimating the inverse (jCai et al. 
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The algorithms we propose may incorporate additional numerical tricks to improve the 
computation speed. Currently full SVD is used in our implementation for updating the low 
rank component. In such situations, a partial SVD, such as partial Lanczos SVD, should 
suffice, which will provide additional numerical savings in large scale problems. We plan 
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to implement this in future releases of our package. 
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